Tencent AVS: A Holistic Ads Video Dataset for Multi-Modal Scene Segmentation

نویسندگان

چکیده

Temporal video segmentation and classification have been advanced greatly by public benchmarks in recent years. However, such research still mainly focuses on human actions, failing to describe videos a holistic view. In addition, previous tends pay much attention visual information yet ignores the multi-modal nature of videos. To fill this gap, we construct an ‘Ads Video Segmentation’ dataset (AVS) ads domain escalate analysis new level. AVS describes from three independent perspectives as ‘presentation form’, ‘place’, ‘style’, contains rich video, audio, text. is organized hierarchically semantic aspects for comprehensive temporal with levels categories multi-label classification, e.g., ‘place’ - ‘working place’ ‘office’. Therefore, distinguished datasets due its information, view categories, hierarchical granularities. It includes 12,000 videos, 82 classes, 33,900 segments, 121,100 shots, 168,500 labels. Accompanied AVS, also present strong baseline coupled class prediction. Extensive experiments are conducted evaluate our proposed method well existing representative methods reveal key challenges AVS.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multi-Modal Scene Interpretation

The visionary goal of developing an easy to use service robot implies several key tasks such as speech understanding, object recognition and scene understanding. Besides the more sensor-oriented capabilities such systems need extensive meta knowledge, e.g., about mental representations of spatial relations to match the view between man and machine. Only if all parts fit together an unrestricted...

متن کامل

Video Scene Segmentation via Continuous Video Coherence

In extended video sequences, individual frames are grouped into shots which are defined as a sequence taken by a single camera, and related shots are grouped into scenes which are defined by a single dramatic event taken by a small number of related cameras. This hierarchical structure is deliberately constructed, dictated by the limitations and preferences of the human visual and memory system...

متن کامل

Co-inference for Multi-modal Scene Analysis

We address the problem of understanding scenes from multiple sources of sensor data (e.g., a camera and a laser scanner) in the case where there is no one-to-one correspondence across modalities (e.g., pixels and 3-D points). This is an important scenario that frequently arises in practice not only when two different types of sensors are used, but also when the sensors are not co-located and ha...

متن کامل

Multi-Modal Scene Understanding for Robotic Grasping

Current robotics research is largely driven by the vision of creating an intelligent being that can perform dangerous, difficult or unpopular tasks. These can for example be exploring the surface of planet mars or the bottom of the ocean, maintaining a furnace or assembling a car. They can also be more mundane such as cleaning an apartment or fetching groceries. This vision has been pursued sin...

متن کامل

Video Scene Segmentation with a Semantic Similarity

Video Scene Segmentation is an important problem in computer vision as it helps in efficient storage, indexing and retrieval of videos. Significant amount of work has been done in this area in the form of shot segmentation techniques and they often give reasonably good results. However, shots are not of much importance for the semantic analysis of the videos. For semantic and meaningful analysi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Access

سال: 2022

ISSN: ['2169-3536']

DOI: https://doi.org/10.1109/access.2022.3227425